Project: Investigate a Dataset - Medical Appointment No Shows¶

Table of Contents¶

  • 1. Introduction
  • 2. Data Wrangling
  • 3. Exploratory Data Analysis
  • 4. Conclusions
  • 5. References

1. Introduction¶

This analysis is part of the Udacity Data Analysis Nanodegree Program and aims to explore the dataset of medical appointments no show in Brazil. This analysis is divided in four main parts:

  1. Introduction, where the initial info is provided and the problem is set.
  2. Data Wrangling, where the data is cleaned and prepared for analysis.
  3. Exploratory Data Analysis, where key patterns are to be found.
  4. Conclusions, in which the insights are described.

Dataset Description¶

This dataset contains over 100K medical appointments realized in Brazil. The information provided concerns The Brazil Public health system, known as SUS (Sistema Único de Saúde, Unified Health System), one of the largest health system in the world and entirely free of any cost. Being a system used by over 220 million Brazillians.

The dataset have 14 columns as so:

  1. PatientID, Identification of a patient.
  2. AppointmentID, Identification of each appointment.
  3. Gender, "M" for Male and "F" for Female.
  4. ScheduledDay, the day the patient set up the appointment.
  5. AppointmentDay.
  6. Age, age in years.
  7. Neighbourhood, the location of the appointment.
  8. Scholarship, whether the patient is enrolled in Bolsa Família.
  9. Hipertension, True or False.
  10. Diabetes, True or False.
  11. Alcoholism, True or False.
  12. Handcap, True or False.
  13. SMS_received, True or False.
  14. No-show, Categorical type, if the appointment was a no-show or not.

Question(s) for Analysis¶

  1. What is the mean time between Scheduled day and the appointment day?
    • There is a direct relation between this time difference and the no-shows?
  2. SMS is a key factor to reduce no-shows?
  3. About location, which neighborhood have the highest no show rate?

Data Wrangling¶

In this section of the report, the data will be loaded in, checked for cleanliness, cleanned as necessary.

In [199]:
# import necessary packages for analysis and data visualization
import pandas as pd
import numpy as np
import plotly.express as px
import plotly.graph_objects as go
import plotly.io as pio
pio.renderers.default = 'notebook'
In [200]:
df = pd.read_csv('noshowappointments-kagglev2-may-2016.csv')
df.head()
Out[200]:
PatientId AppointmentID Gender ScheduledDay AppointmentDay Age Neighbourhood Scholarship Hipertension Diabetes Alcoholism Handcap SMS_received No-show
0 2.987250e+13 5642903 F 2016-04-29T18:38:08Z 2016-04-29T00:00:00Z 62 JARDIM DA PENHA 0 1 0 0 0 0 No
1 5.589978e+14 5642503 M 2016-04-29T16:08:27Z 2016-04-29T00:00:00Z 56 JARDIM DA PENHA 0 0 0 0 0 0 No
2 4.262962e+12 5642549 F 2016-04-29T16:19:04Z 2016-04-29T00:00:00Z 62 MATA DA PRAIA 0 0 0 0 0 0 No
3 8.679512e+11 5642828 F 2016-04-29T17:29:31Z 2016-04-29T00:00:00Z 8 PONTAL DE CAMBURI 0 0 0 0 0 0 No
4 8.841186e+12 5642494 F 2016-04-29T16:07:23Z 2016-04-29T00:00:00Z 56 JARDIM DA PENHA 0 1 1 0 0 0 No
In [201]:
df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 110527 entries, 0 to 110526
Data columns (total 14 columns):
 #   Column          Non-Null Count   Dtype  
---  ------          --------------   -----  
 0   PatientId       110527 non-null  float64
 1   AppointmentID   110527 non-null  int64  
 2   Gender          110527 non-null  object 
 3   ScheduledDay    110527 non-null  object 
 4   AppointmentDay  110527 non-null  object 
 5   Age             110527 non-null  int64  
 6   Neighbourhood   110527 non-null  object 
 7   Scholarship     110527 non-null  int64  
 8   Hipertension    110527 non-null  int64  
 9   Diabetes        110527 non-null  int64  
 10  Alcoholism      110527 non-null  int64  
 11  Handcap         110527 non-null  int64  
 12  SMS_received    110527 non-null  int64  
 13  No-show         110527 non-null  object 
dtypes: float64(1), int64(8), object(5)
memory usage: 11.8+ MB
In [202]:
df.duplicated().value_counts()
Out[202]:
False    110527
dtype: int64

As it is seen in the previous outputs, there is no missing data and no duplicated data.

In [203]:
# Change names to lower case
for name in df.columns:
    df.rename(columns={name:name.lower()},
              inplace=True)

# Rename incorrect column names
df.rename(columns={'hipertension': 'hypertension',
                   'handcap': 'handicap',
                   'no-show': 'noshow'},
                   inplace=True);

Categorical Data¶

PatientId¶

The column is loaded as float64, but it represents the unique identification of the patient. So there is no reason to apply any numerical operations.

In [204]:
# Change the dtype of PatientId to string
# The PatientId is changed to int first, just to remove the decimals and have a cleaner look
df['patientid'] = df['patientid'].astype('int64').astype(str)
In [205]:
# Checking for number of uniques patients Id.
df.patientid.nunique()
Out[205]:
62299

The lower number of unique patients indicates that one or more patients made more than one appointment.

Appointment ID¶

As so as Patient ID, this data should be a string value.

In [206]:
# Change the dtype of AppointmentID
df['appointmentid'] = df['appointmentid'].astype(str)
In [207]:
df['appointmentid'].nunique()
Out[207]:
110527

This column have all unique values and cotains the information generated by a system do unique identify appointments. For the analysis, this columns will be used as index for more clarity and meaning to the index.

In [208]:
df.set_index('appointmentid', drop=True, inplace=True)

Gender and No-Show¶

Both will be checked for any inconsistenses and have the datatype change to categorical for data efficiency

In [209]:
#Checking for the different values
df.gender.unique()
Out[209]:
array(['F', 'M'], dtype=object)
In [210]:
# Changing the dtype to category
df['gender'] = df['gender'].astype('category')
In [211]:
df.noshow.unique()
Out[211]:
array(['No', 'Yes'], dtype=object)
In [212]:
# Changing the dtype to category
df['noshow'] = df['noshow'].astype('category')

Neighbourhood¶

In [213]:
df.neighbourhood.unique()
Out[213]:
array(['JARDIM DA PENHA', 'MATA DA PRAIA', 'PONTAL DE CAMBURI',
       'REPÚBLICA', 'GOIABEIRAS', 'ANDORINHAS', 'CONQUISTA',
       'NOVA PALESTINA', 'DA PENHA', 'TABUAZEIRO', 'BENTO FERREIRA',
       'SÃO PEDRO', 'SANTA MARTHA', 'SÃO CRISTÓVÃO', 'MARUÍPE',
       'GRANDE VITÓRIA', 'SÃO BENEDITO', 'ILHA DAS CAIEIRAS',
       'SANTO ANDRÉ', 'SOLON BORGES', 'BONFIM', 'JARDIM CAMBURI',
       'MARIA ORTIZ', 'JABOUR', 'ANTÔNIO HONÓRIO', 'RESISTÊNCIA',
       'ILHA DE SANTA MARIA', 'JUCUTUQUARA', 'MONTE BELO',
       'MÁRIO CYPRESTE', 'SANTO ANTÔNIO', 'BELA VISTA', 'PRAIA DO SUÁ',
       'SANTA HELENA', 'ITARARÉ', 'INHANGUETÁ', 'UNIVERSITÁRIO',
       'SÃO JOSÉ', 'REDENÇÃO', 'SANTA CLARA', 'CENTRO', 'PARQUE MOSCOSO',
       'DO MOSCOSO', 'SANTOS DUMONT', 'CARATOÍRA', 'ARIOVALDO FAVALESSA',
       'ILHA DO FRADE', 'GURIGICA', 'JOANA D´ARC', 'CONSOLAÇÃO',
       'PRAIA DO CANTO', 'BOA VISTA', 'MORADA DE CAMBURI', 'SANTA LUÍZA',
       'SANTA LÚCIA', 'BARRO VERMELHO', 'ESTRELINHA', 'FORTE SÃO JOÃO',
       'FONTE GRANDE', 'ENSEADA DO SUÁ', 'SANTOS REIS', 'PIEDADE',
       'JESUS DE NAZARETH', 'SANTA TEREZA', 'CRUZAMENTO',
       'ILHA DO PRÍNCIPE', 'ROMÃO', 'COMDUSA', 'SANTA CECÍLIA',
       'VILA RUBIM', 'DE LOURDES', 'DO QUADRO', 'DO CABRAL', 'HORTO',
       'SEGURANÇA DO LAR', 'ILHA DO BOI', 'FRADINHOS', 'NAZARETH',
       'AEROPORTO', 'ILHAS OCEÂNICAS DE TRINDADE', 'PARQUE INDUSTRIAL'],
      dtype=object)
In [214]:
df.neighbourhood.nunique()
Out[214]:
81

The Neighbourhood columns has 81 unique values and thereis no need to change anything.

Datetime data¶

Both columns AppointmentDay and ScheduleDay should be in datetime format.

In [215]:
#Change both columns to datetime and normalizing the columns to kee just the "day" information
df['appointmentday']= pd.to_datetime(df['appointmentday']).dt.normalize()
df['scheduledday'] = pd.to_datetime(df['scheduledday']).dt.normalize()

# Create a month column for easy data manipulation
df['appointmentday_month'] = df['appointmentday'].dt.month_name()
df['scheduledday_month'] = df['scheduledday'].dt.month_name()

It can be presumed that all the ScheduledDays are made before the Appointment Day.

To check this, a new column will becreated and filter by this information.

In [216]:
# Create a new column 'BetweenDays'
df['betweendays'] = df['appointmentday'] - df['scheduledday']

# Create a colunm with the value in float for easy data manipulation
df['betweendays_float'] = df['betweendays'].dt.total_seconds() / (24 * 60 * 60)

# Check for posible inconsistenses
df['betweendays'].describe()
Out[216]:
count                        110527
mean     10 days 04:24:31.828602965
std      15 days 06:07:11.673762786
min               -6 days +00:00:00
25%                 0 days 00:00:00
50%                 4 days 00:00:00
75%                15 days 00:00:00
max               179 days 00:00:00
Name: betweendays, dtype: object
In [217]:
# Checking how many data have inconistent data.
df[df['betweendays'].dt.days < 0].shape
Out[217]:
(5, 17)

Since there is only 5 rows with invalid data, the data will be droped from the dataframe

In [218]:
df.drop(
    df[df['betweendays'].dt.days < 0].index,
    inplace=True
)

Lets transform the float value in the betweendays_float in a categorical type. This will help in further analysis by grouping this data.

In [219]:
def categorical_betweendays(row):
    if row['betweendays_float'] == 0:
        return '0 Days'
    elif row['betweendays_float'] <=3:
        return '1-3 Days'
    elif row['betweendays_float'] <=7:
        return '4-7 Days'
    elif row['betweendays_float'] <=14:
        return '8-14 Days'
    elif row['betweendays_float'] <=30:
        return '15-30 Days'
    else:
        return '30+ Days'
In [220]:
df['betweendays_cate'] = df.apply(categorical_betweendays, axis=1)

Numeric data¶

For the numeric data, it is expected that the rows 'Scholarship', 'Hypertension', 'Diabetes', 'Alcoholism', 'Handicap' and 'SMSReceived' only have ints from zero to one.

In [221]:
df[['scholarship', 'hypertension', 'diabetes', 'alcoholism', 'handicap', 'sms_received']].describe()
Out[221]:
scholarship hypertension diabetes alcoholism handicap sms_received
count 110522.000000 110522.000000 110522.000000 110522.000000 110522.000000 110522.000000
mean 0.098270 0.197255 0.071868 0.030401 0.022231 0.321040
std 0.297681 0.397928 0.258270 0.171689 0.161493 0.466878
min 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000
25% 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000
50% 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000
75% 0.000000 0.000000 0.000000 0.000000 0.000000 1.000000
max 1.000000 1.000000 1.000000 1.000000 4.000000 1.000000

As it is possible to see, all rows have only ones and zeros, except from Handicap, where the max value is four.

As it is stated in this discussion on Kaggle, that means the number of desabilites a person has. For exemple, a person who is blind and has a difficult to walk, the total value should be 2.

With this, we will check for the number of values and leave it the same way.

In [222]:
print(
    f"""
Number of Appoints where Handicap > 1: {df[df.handicap > 1].shape[0]}
Percentual: {df[df.handicap > 1].shape[0] / df.shape[0] * 100:.3f} % 
    """
)
Number of Appoints where Handicap > 1: 199
Percentual: 0.180 % 
    

There is aproximally only 0.2% of the data with with more than one desabilit.

Age column¶

For the Age column, we need to checkif there is any inconsistent value.

In [223]:
df.age.describe()
Out[223]:
count    110522.000000
mean         37.089041
std          23.110064
min          -1.000000
25%          18.000000
50%          37.000000
75%          55.000000
max         115.000000
Name: age, dtype: float64

We see outiers as there is negative data and values over 100.

As for the negative data, it has no meaning, so it will be droped.

As for the values over 100, Althoud it is hard to believe, we will check the number of cases and see how it impact the data.

In [224]:
# check for values lower 0 years
df[df.age < 0]
Out[224]:
patientid gender scheduledday appointmentday age neighbourhood scholarship hypertension diabetes alcoholism handicap sms_received noshow appointmentday_month scheduledday_month betweendays betweendays_float betweendays_cate
appointmentid
5775010 465943158731293 F 2016-06-06 00:00:00+00:00 2016-06-06 00:00:00+00:00 -1 ROMÃO 0 0 0 0 0 0 No June June 0 days 0.0 0 Days
In [225]:
# Drop that column
df.drop(
    df[df.age < 0].index,
    inplace= True
)
In [226]:
df[df.age > 100]
Out[226]:
patientid gender scheduledday appointmentday age neighbourhood scholarship hypertension diabetes alcoholism handicap sms_received noshow appointmentday_month scheduledday_month betweendays betweendays_float betweendays_cate
appointmentid
5651757 976294799775439 F 2016-05-03 00:00:00+00:00 2016-05-03 00:00:00+00:00 102 CONQUISTA 0 0 0 0 0 0 No May May 0 days 0.0 0 Days
5700278 31963211613981 F 2016-05-16 00:00:00+00:00 2016-05-19 00:00:00+00:00 115 ANDORINHAS 0 0 0 0 1 0 Yes May May 3 days 3.0 4-7 Days
5700279 31963211613981 F 2016-05-16 00:00:00+00:00 2016-05-19 00:00:00+00:00 115 ANDORINHAS 0 0 0 0 1 0 Yes May May 3 days 3.0 4-7 Days
5562812 31963211613981 F 2016-04-08 00:00:00+00:00 2016-05-16 00:00:00+00:00 115 ANDORINHAS 0 0 0 0 1 0 Yes May April 38 days 38.0 30+ Days
5744037 31963211613981 F 2016-05-30 00:00:00+00:00 2016-05-30 00:00:00+00:00 115 ANDORINHAS 0 0 0 0 1 0 No May May 0 days 0.0 0 Days
5751563 234283596548 F 2016-05-31 00:00:00+00:00 2016-06-02 00:00:00+00:00 102 MARIA ORTIZ 0 0 0 0 0 0 No June May 2 days 2.0 1-3 Days
5717451 748234579244724 F 2016-05-19 00:00:00+00:00 2016-06-03 00:00:00+00:00 115 SÃO JOSÉ 0 1 0 0 0 1 No June May 15 days 15.0 15-30 Days

Fo the data with age over 100, it is possible to see that there are few cases, only 7.

And by looking in the PatientId, 4 of those appointments were made for the same patient in the spam of 3 months.

Exploratory Data Analysis¶

As a step to prepare for the further analysis, functions will be defined to help explore more of the columns

In [227]:
def get_statistcs(df, col_name, height=400, width=400):
    """
    Loads resume statics for certain colunm in a pandas dataframe. Also prints a basic violin plot.
        
    Args:
        df_col: Pandas DataFrame. 
        col_name: str, column's name. Ntype numeric only.
        height: int, graph's height. Default = 400
        width: int, graph's width. Default = 400
    Return:
        Statistics, violin plot
    """
    print(f"""
    Mean: {df[col_name].mean():.3f} 
    Std deviation: {np.std(df[col_name]):.3f} 
    Max value: {df[col_name].max()}
    Min value: {df[col_name].min()}
    Median: {df[col_name].median()}

    Number of unique values: {df[col_name].nunique()}
""")
    
    fig = px.violin(df, y=col_name, box=True, height=height, width=width, points='outliers')
    fig.show()

1. What is the mean time between Scheduled day and the appointment day? (done)¶

In [228]:
get_statistcs(df, 'betweendays_float')
    Mean: 10.184 
    Std deviation: 15.255 
    Max value: 179.00000000000003
    Min value: 0.0
    Median: 4.0

    Number of unique values: 129

In [229]:
# Find the betweendays mean
print('---')
print(f'Mean time between schedule and appointment:\n{df.betweendays.mean()}')
print('---')

# Plot a graph of this mean over for each month
df_fig = pd.DataFrame(df.groupby('appointmentday_month').betweendays_float.mean()).reset_index()

fig1 = px.bar(
    df_fig,
    x = 'appointmentday_month',
    y = 'betweendays_float',
    height= 400,
    width= 500,
    title='Mean days between Schedule and Appointment',
    text_auto='.2f'
)
fig1.show()
---
Mean time between schedule and appointment:
10 days 04:25:27.412889858
---

As it is possible to see by the graph, the mean days between schedule and appointment days is about 10 days. The standard deviation is about 15 days, meaning that the data is well dispersed. That also suggests that the data does not follow a standard distribuition.

There is no significant variation over time.

1.1. There is a direct relation between this time difference and the no-shows?¶

In [230]:
# Group data by noshow
df_fig = df.groupby('noshow')['betweendays_float'].mean().reset_index()

fig = px.bar(
    df_fig,
    x = 'noshow',
    y = 'betweendays_float',
    height= 400,
    width= 500,
    title='Mean days between by noshow',
    text_auto='.2f'
)
fig.show()

As the graph above suggests, there is a relation between the noshow and the days between scheduled and appoinment.

In [231]:
# Group data by noshow
df_fig = df.groupby(['betweendays_cate','noshow']).patientid.count().reset_index()

df_fig = df_fig.pivot(
    index='betweendays_cate',
    columns='noshow',
    values='patientid'
).reset_index()

df_fig['noshow_rate'] =( df_fig['Yes'] / (df_fig['No'] + df_fig['Yes'])) * 100
df_fig.sort_values(by='noshow_rate', inplace=True)

fig = px.bar(
    df_fig,
    x = 'betweendays_cate',
    y = 'noshow_rate',
    # color = 'noshow',
    # barmode='group',
    height= 400,
    width= 500,
    title='Noshow rate over between days',
    text_auto='.2f'
)
fig.show()

From the graph, it can be shown that the noshow rate increases the longer the time between scheduling and the scheduled day. The lowest value is reached when the consultation occurs on the same day, where the noshow rate is less than 5%. in this case, the noshow may be associated with emergencies.

2. Is SMS a key factor to reduce no-shows?¶

In [232]:
get_statistcs(df, 'sms_received')
    Mean: 0.321 
    Std deviation: 0.467 
    Max value: 1
    Min value: 0
    Median: 0.0

    Number of unique values: 2

The graph above does not present a meaningful visualization as SMS data can be interpreted as categorical. However, the relevant value is the average, which shows us the frequency of receiving SMS, being around 30%.

Therefore, it is interesting to analyze whether there is any relationship with location, as SMS may not be sent by some location.

In [233]:
df_fig = df.groupby(['neighbourhood', 'sms_received']).patientid.count().reset_index()

df_fig1 = df_fig.pivot(
    index='neighbourhood',
    columns='sms_received',
    values='patientid'
).reset_index()

df_fig1['sms_received_rate'] =( df_fig1[1] / (df_fig1[1] + df_fig1[0])) * 100


fig = px.bar(
    df_fig1,
    x = 'neighbourhood',
    y = 'sms_received_rate',
    height= 400,
    width= 800,
    title='Noshow rate if sms_received',
    text_auto='.1f'
)
fig.show()
In [234]:
get_statistcs(df_fig1, 'sms_received_rate')
    Mean: 33.081 
    Std deviation: 6.388 
    Max value: 60.0
    Min value: 12.5
    Median: 33.33333333333333

    Number of unique values: 79

In [235]:
df_fig1.loc[df_fig1['neighbourhood'].isin(['ILHAS OCEÂNICAS DE TRINDADE', 'PARQUE INDUSTRIAL'])]
Out[235]:
sms_received neighbourhood 0 1 sms_received_rate
34 ILHAS OCEÂNICAS DE TRINDADE 2.0 NaN NaN
51 PARQUE INDUSTRIAL 1.0 NaN NaN

With the analyzes and graphs above, it is possible to see that there are 2 outliers that have a zero sms_received_rate value. This happens because there is not enough sample for these neighbourhoods. Since there are only 3 entries in the original dataframe.

Despite this, when analyzing the proportion of all neighbourhoods, it is possible to see a distribution close to normal and with an average of 33%, and standard deviation of 6%.

There is no big difference between the distributions between the neighborhoods and apart from the outliers, there is no impediment to continuing the analyses.

Now let's look at the no-show rate in relation to sms_received

In [236]:
df_fig = df.groupby(['sms_received', 'noshow']).patientid.count().reset_index()

df_fig = df_fig.pivot(
    index='sms_received',
    columns='noshow',
    values='patientid'
).reset_index()

df_fig['noshow_rate'] =( df_fig['Yes'] / (df_fig['No'] + df_fig['Yes'])) * 100

fig = px.bar(
    df_fig,
    x = 'sms_received',
    y = 'noshow_rate',
    height= 400,
    width= 500,
    title='Noshow rate if sms_received',
    text_auto='.1f'
)
fig.show()

The previous result is different from what was expected. It is expected that when receiving the SMS, the reminder is a mechanism to avoid noshow, however the highest noshow rate appears when the SMS is received.

Now let's segment this result by the days between the scheduled and scheduled to investigate further.

In [237]:
df_fig = df.groupby(['betweendays_cate','sms_received', 'noshow']).patientid.count().reset_index()

df_fig = df_fig.pivot(
    index=['betweendays_cate','sms_received'],
    columns='noshow',
    values='patientid'
).reset_index()

df_fig['noshow_rate'] =( df_fig['Yes'] / (df_fig['No'] + df_fig['Yes'])) * 100
df_fig['sms_received'] = df_fig['sms_received'].astype(str)

fig2 = px.bar(
    df_fig,
    x = 'betweendays_cate',
    y = 'noshow_rate',
    color='sms_received',
    barmode='group',
    title='No show rate by sms_received over days between',
    text_auto=True
)

fig2.show()

When segmenting the data by categories of days between, the result is as expected.

The previous value was probably biased due to there being no SMS sent for appointments on the same day and for appointments between 1 and 3 days apart.

With the analysis above, it can be seen that SMS contributes to reducing the noshow rate.

3. About location, with neighborhood have the highest no show rate? (done)¶

In [238]:
get_statistcs(df_fig, 'noshow_rate')
    Mean: 27.401 
    Std deviation: 8.945 
    Max value: 37.525201612903224
    Min value: 4.647061874384108
    Median: 28.956443683457362

    Number of unique values: 10

In [239]:
# Group data by noshow
df_fig = df.groupby(['neighbourhood', 'noshow']).patientid.count().reset_index()

df_fig = df_fig.pivot(
    index='neighbourhood',
    columns='noshow',
    values='patientid'
).reset_index()

# Filter for low volume of data
df_fig = df_fig.where(df_fig['Yes'] + df_fig['No'] > 10)

# Create noshow_rate column
df_fig['noshow_rate'] =( df_fig['Yes'] / (df_fig['No'] + df_fig['Yes'])) * 100

# Sort values and get the top 10
df_fig1 = df_fig.sort_values(by='noshow_rate', ascending=False).head(10)

fig3 = px.bar(
    df_fig1,
    x = 'neighbourhood',
    y = 'noshow_rate',
    height= 400,
    width= 500,
    title='Top 10 neighbourhood by noshow rate',
    text_auto='.1f'
)
fig3.show()
In [240]:
df_fig2 = df_fig.sort_values(by='noshow_rate', ascending=False).tail(10)

fig = px.bar(
    df_fig2,
    x = 'neighbourhood',
    y = 'noshow_rate',
    height= 400,
    width= 500,
    title='Bottom 10 neighbourhood by noshow rate',
    text_auto='.1f'
)
fig.show()
In [241]:
get_statistcs(df_fig, 'noshow_rate')
    Mean: 19.887 
    Std deviation: 3.080 
    Max value: 28.91849529780564
    Min value: 8.571428571428571
    Median: 19.758848697005057

    Number of unique values: 77

When analyzing the noshow rate by neighbourhood, it can be seen that some regions have a higher noshow rate than others, however, when analyzing the average and distribution data with statistics and the violin graph, it can be seen that there is less dispersion, being closer to a normal distribution.


Conclusions¶

This analysis sought to explore data from medical appointments made by the Unified Health System (SUS) in Brazil. The focus of the analysis and the problem is understanding the reasons why more than 30% of appointments are canceled or simply not made. This high volume of noshows shows an inefficiency in the scheduling system and also an operational inefficiency, so that maintaining these indicators represents a large financial volume released.

The analyzed data was treated and explored as previously described with the purpose of understanding the possible causes of no shows, but statistical and conclusive conclusions cannot be made. The database presents a series of limitations regarding the explanation of the data and it is also necessary to carry out a more in-depth study, taking into account other factors to be able to draw better conclusions in actions.

About analysis limitations¶

  • Concerning data scope

The data only relates to the municipality of Vitória (ES). As the SUS is a nationwide system, when analyzing data from just one region, we have an enviable and not very comprehensive analysis.

The data also relates to a period of just 3 months. For a system of the magnitude of the SUS, having a larger time scope would also be more relevant to analyze causes of seasonality and changes over time.

  • Concerning data quality

Regarding the quality of the data, such a detailed explanation of the information was not available, so that in some scenarios, some conclusions were inferred. An example was the case of values ​​greater than 1 in the handicap column, as explained previously.

The sms_received data also appears to be incomplete, meaning that there is no information about the sending attempt and, based on analysis, when the period between scheduling and appointment is short, up to 3 days, sending/receiving SMS does not happen.

About the questions made¶

1. What is the mean time between Scheduled day and the appointment day?¶

In [242]:
fig1.show()

As you can see from the graph above, the average number of days is 10 days and 4 hours. This value does not show major changes over the months and the reasons for the change were not identified.

2. SMS is a key factor to reduce no-shows?¶

In [243]:
fig2.show()

The graph above shows that yes, receiving SMS is a significant factor in reducing no-shows. The visualization also shows that this process does not occur when the time gap between scheduling and appointment, as there is no data for when the time gap is short.

3. About location, which neighborhood have the highest no show rate?¶

In [244]:
fig3.show()

As the graph above shows, the neighbourhoods that have the highest noshow rates are:

  1. SANTOS DUMONT
  2. SANTA CECÍLIA
  3. SANTA CLARA
  4. ITARARÉ
  5. JESUS DE NAZARETH

References¶

  • Applying heatmaps for categorical data analysis
  • Predict Show/NoShow - EDA+Visualization+Model
  • Sistema Unico de Saude
  • Project data by Kaggle
In [245]:
# Running this cell will execute a bash command to convert this notebook to an .html file
# !python -m nbconvert --to html Investigate_a_Dataset.ipynb